Project 2 - Page Rank
Assigned: Jan 31 2008. Due:Monday, Feb 11 2008, 11:59 PM
Partners: You may work alone or with a partner. No groups bigger than 2 please.
Turnin: Your code and writeup before the due data. Turnin
here
Introduction:
Page Rank is an algorithm that operates on a link graph assigning to each node of the graph a numerical weight.
It's purpose is to associate with every node in the graph a relative 'importance.' Page Rank is an iterative
algorithm that runs on a graph with its nodes initialized to a certain set of values until the values converge
(ie - stop changing).
The relevant citation is as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping
factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the
next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A
is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages'
PageRanks will be one.
For more information on the Page Rank algorithm you will need to read the paper:
http://infolab.stanford.edu/~backrub/google.html (Section 2.1)
Your Goal:
Implement PageRank using map reduce, turn Wikipedia into a giant graph, run PageRank on said graph, run
it several more times (ideally until the values converge), return (in a humanly parsable sort of way) the
PageRank of all the articles. You will also need to implement at least one extension.
Data Set:
As your final data set, you will use a crawl of wikipedia articles. You should already be familiar with it from
project 1, but here is a refresher. The data set is located on the DFS at
/shared/wikipedia (roughly 12 GB of text in 193 files each sized roughly 64MB).
Each wikipedia article is on its own line. A typical line of the data set (article) will start as follows:
<page> <title>Critical theory</title>
... (snip) ...
<text xml:space="preserve">In the [[humanities]] and [[social sciences]], '''critical theory'''
has two quite different meanings with different origins and histories, one originating in [[social theory]]
and the other in [[literary criticism]]
... (snip)
Notice that the title of the article is delimited by an XML title tag, and the text of the article is delimited
by an XML text tag. The links in the article to other wikipedia articles are delimited by "[[" and "]]". Also,
certain links will look like this: "[[Longer Real Name|Short Name]]". These links appear in the article with "Short
Name" but link to an article titled "Longer Real Name".
A Possible Outline:
- graphBuilder - Construct a link graph out of wikipedia. Make sure this is space efficient - what kind
of graph representation can you use? You may also find SequenceFileOutputFormat useful for this.
- pageRankIter - One iteration updates the page rank values for your page rank graph.
- pageRankViewer - Returns the page ranks for wikipedia articles in a human readable way. (For example
converting from SequenceFileInputFormat to TextOutpuFormat and sorting on page rank.)
Extensions:
You must implement at least one extension, either from the list below or your own, and you should feel free to implement
more!
- Link Hack Wikipedia. Use your data to subtly edit important Wikipedia pages to increase the
pagerank of a pet cause. Tell us what you did and why it should work.
- Implement variably weighted links. Explain your heuristic and convince us that its working.
- Combine project 1 (Inverted Indexer) and your Page Ranker into a rudimentary search engine over wikipedia.
Explain how you used the page rank when retrieving search results.
Hints:
- You can run a hadoop job locally for testing purposes:
- In Windows: Make sure you have Cygwin installed. Then, make sure
that 'CYGWIN_INSTALL_DIR\bin' is in your PATH environment variable where CYGWIN_INSTALL_DIR is an actual dir like C:\cygwin. (If you need help with this see
here)
- Make sure the directories pointed to by source and destination are valid. Note that your working directory
is now your current directory, so if you are starting from eclipse, the working directory is your project's root
directory
- Then, just run the main method of your job's driver.
- Get extra use out of your PageRank reduce method by also setting it as your combiner class (why is this ok?)
- You can get at more information during your mapper and reducer execution by using the 'reporter.setStatus()'
method in your mapper and reducer. This will allow you to set your mapper/reducer's status visible on the web interface
- You can also set the name of your job (that will apear on the web interface) by calling 'conf.setJobName()' in your
driver. This can be useful to keep track of your Page Rank iterations.
- You can read directly from the DFS. This is especially useful if you combine this with passing parameters to your mappers
as described in project 1 hints. The relevant class is '
- The Hadoop API documentation is found at the hadoop website - It can be very useful.
Quick Link.
- Read the Cluster Notes and use the Job Tracker!
- Start early!
Write-Up:
- How long do you think this project will take you to finish?
- How much time did you actually spend on the project?
- Acknowledge any assistance you received from anyone except assigned course readings and the
course staff.
- Explain why it's ok to use a combiner class with the PageRank algorithm.
- What are the ten pages with highest rank in the provided Wikipedia corpus?
- Describe any extensions you've implemented.